scene text
TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification
Tao, Huaqi, Liu, Bingxi, Chen, Calvin, Huang, Tingjun, Li, He, Cui, Jinqiang, Zhang, Hong
Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene text typically appears in indoor spaces, serving to distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene text. Finally, the discriminative text is filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.04)
VQA-Levels: A Hierarchical Approach for Classifying Questions in VQA
Madaka, Madhuri Latha, Bhagvati, Chakravarthy
Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark datasets have been developed by researchers but there are many issues with using them for methodical performance tests. This paper proposes a new benchmark dataset -- a pilot version called VQA-Levels is ready now -- for testing VQA systems systematically and assisting researchers in advancing the field. The questions are classified into seven levels ranging from direct answers based on low-level image features (without needing even a classifier) to those requiring high-level abstraction of the entire image content. The questions in the dataset exhibit one or many of ten properties. Each is categorised into a specific level from 1 to 7. Levels 1 - 3 are directly on the visual content while the remaining levels require extra knowledge about the objects in the image. Each question generally has a unique one or two-word answer. The questions are 'natural' in the sense that a human is likely to ask such a question when seeing the images. An example question at Level 1 is, ``What is the shape of the red colored region in the image?" while at Level 7, it is, ``Why is the man cutting the paper?". Initial testing of the proposed dataset on some of the existing VQA systems reveals that their success is high on Level 1 (low level features) and Level 2 (object classification) questions, least on Level 3 (scene text) followed by Level 6 (extrapolation) and Level 7 (whole scene analysis) questions. The work in this paper will go a long way to systematically analyze VQA systems.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.04)
- Asia > India > Telangana > Hyderabad (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.36)
InstructOCR: Instruction Boosting Scene Text Spotting
Duan, Chen, Jiang, Qianyi, Fu, Pei, Chen, Jiamin, Li, Shengxi, Wang, Zining, Guo, Shan, Luo, Junfeng
In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
Nguyen, Nghia Hieu, Quan, Tho Thanh, Nguyen, Ngan Luu-Thuy
Text-based VQA is a challenging task that requires machines to use scene texts in given images to yield the most appropriate answer for the given question. The main challenge of text-based VQA is exploiting the meaning and information from scene texts. Recent studies tackled this challenge by considering the spatial information of scene texts in images via embedding 2D coordinates of their bounding boxes. In this study, we follow the definition of meaning from linguistics to introduce a novel method that effectively exploits the information from scene texts written in Vietnamese. Experimental results show that our proposed method obtains state-of-the-art results on two large-scale Vietnamese Text-based VQA datasets. The implementation can be found at this link.
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.82)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.52)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
Robust Loop Closure by Textual Cues in Challenging Environments
Jin, Tongxing, Nguyen, Thien-Minh, Xu, Xinhang, Yang, Yizhuo, Yuan, Shenghai, Li, Jianping, Xie, Lihua
Loop closure is an important task in robot navigation. However, existing methods mostly rely on some implicit or heuristic features of the environment, which can still fail to work in common environments such as corridors, tunnels, and warehouses. Indeed, navigating in such featureless, degenerative, and repetitive (FDR) environments would also pose a significant challenge even for humans, but explicit text cues in the surroundings often provide the best assistance. This inspires us to propose a multi-modal loop closure method based on explicit human-readable textual cues in FDR environments. Specifically, our approach first extracts scene text entities based on Optical Character Recognition (OCR), then creates a local map of text cues based on accurate LiDAR odometry and finally identifies loop closure events by a graph-theoretic scheme. Experiment results demonstrate that this approach has superior performance over existing methods that rely solely on visual and LiDAR sensors. To benefit the community, we release the source code and datasets at \url{https://github.com/TongxingJin/TXTLCD}.
Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval
Zeng, Gangyan, Zhang, Yuan, Wei, Jin, Yang, Dongbao, Zhang, Peng, Gao, Yiwen, Qin, Xugong, Zhou, Yu
Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy compared to existing methods. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text. The source code will be publicly available at https://github.com/Gyann-z/FDP.
- Oceania > Australia > Victoria > Melbourne (0.15)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Van Nguyen, Quan, Tran, Dan Quang, Pham, Huy Quang, Nguyen, Thang Kien-Bao, Nguyen, Nghia Hieu, Van Nguyen, Kiet, Nguyen, Ngan Luu-Thuy
Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (\textbf{Vi}etnamese \textbf{Text}-based \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering dataset) which contains \textbf{over 16,000} images and \textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this \href{https://github.com/minhquan6203/ViTextVQA-Dataset}{link} for research purposes.
- Asia > Vietnam > Hanoi > Hanoi (0.14)
- Asia > Vietnam > Vĩnh Long Province > Vĩnh Long (0.04)
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- (7 more...)
- Research Report > Promising Solution (1.00)
- Overview (1.00)
- Education (0.66)
- Health & Medicine (0.47)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Ensemble Learning for Vietnamese Scene Text Spotting in Urban Environments
Nguyen, Hieu, Ta, Cong-Hoang, Le-Nguyen, Phuong-Thuy, Tran, Minh-Triet, Le, Trung-Nghia
This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Wen, Bingbing, Yang, Zhengyuan, Wang, Jianfeng, Gan, Zhe, Howe, Bill, Wang, Lijuan
In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.
- North America > United States > California (0.04)
- North America > Dominican Republic (0.04)
- North America > United States > Colorado (0.04)
- (3 more...)
- Research Report (0.64)
- Questionnaire & Opinion Survey (0.46)
SPTS v2: Single-Point Scene Text Spotting
Liu, Yuliang, Zhang, Jiaxin, Peng, Dezhi, Huang, Mingxin, Wang, Xinyu, Tang, Jingqun, Huang, Can, Lin, Dahua, Shen, Chunhua, Bai, Xiang, Jin, Lianwen
Abstract--End-to-end scene text spotting has made significant progress due to its intrinsic synergy between text detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated rectangles, quadrangles, and polygons as a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us to train high-performing text-spotting models using a single-point annotation. SPTS v2 reserves the advantage of the auto-regressive Transformer with an Instance Assignment Decoder (IAD) through sequentially predicting the center points of all text instances inside the same predicting sequence, while with a Parallel Recognition Decoder (PRD) for text recognition in parallel, which significantly reduces the requirement of the length of the sequence. These two decoders share the same parameters and are interactively connected with a simple but effective information transmission process to pass the gradient and information. Comprehensive experiments on various existing benchmark datasets demonstrate the SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters while achieving 19 faster inference speed. Within the context of our SPTS v2 framework, our experiments suggest a potential preference for single-point representation in scene text spotting when compared to other representations. Such an attempt provides a significant opportunity for scene text spotting applications beyond the realms of existing paradigms.
- Asia > China > Hong Kong (0.04)
- North America > Canada > British Columbia (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (2 more...)